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Abstract 

In this paper we investigate mathematical questions concerning the reUabiUty (reconstruc- 
tion accuracy) of Fitch's maximum parsimony algorithm for reconstructing the ancestral 
state given a phylogenetic tree and a character. In particular, we consider the question 
whether the maximum parsimony method applied to a subset of taxa can reconstruct the 
ancestral state of the root more accurately than when applied to all taxa, and we give an 
example showing that this indeed is possible. A surprising feature of our example is that 
ignoring a taxon closer to the root improves the reliability of the method. On the other 
hand, in the case of the two-state symmetric substitution model, we answer affirmatively 
a conjecture of Li, Steel and Zhang which states that under a molecular clock the prob- 
ability that the state at a single taxon is a correct guess of the ancestral state is a lower 
bound on the reconstruction accuracy of Fitch's method applied to all taxa. 
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In a r ecent s tudy 



method (iFitchl . 



(iLi et al. 



20081 ). a likelihood analysis of Fitch's maximum parsimony 
19711 ) (which we call MP for short) for the reconstruction of the ancestral 
state at the root was conducted. It was shown that in a rooted phylogenetic tree if a 
leaf (taxon) is closer to the root than all the other leaves, then the character state at this 
leaf may sometimes be a more accurate guess of the ancestral state than the ancestral 
state constructed by MP applied to all taxa. The authors also provided an example of 
a phylogenetic tree for which MP for the reconstruction of the root state works more 
reliably on a subset of taxa closer to the root than on all taxa. 

Generally the root state is more likely to be conserved on taxa that are nearer to the 
root than on taxa that are farther away. Therefore, it is not surprising that on some trees 
the root state can be more reliably estimated by looking at only taxa nearer to the root. 
But can the reconstruction accuracy of MP improve when a taxon or a subset of taxa 
close to the root is ignored? We presented a surprising example of a tree on which MP on 
a subset of taxa is more likely to reconstruct the correct ancestral state. In our example, 
the reconstruction accuracy improves when we ignore a taxon close to the root from our 
analysis. Moreover, the ignored taxon may be arbitrarily close to the root compared to 
the taxa that are not ignored. On the other hand, we show that under a molecular clock, 
considering a single taxon is never better than considering all taxa for the purpose of 
ancestral state reconstruction. Our analysis partially resolves a conjecture of Li, Steel 
and Zhang. They conjectured that under a molecular clock, maximum parsimony on all 
taxa is expected to generally perform at least as good (in the sense of the reconstruction 
accuracy) as reconstructing the ancestral state based on the character state at a single 
taxon. We make the conjecture precise and answer it affirmatively for the case of the 
two-state symmetric model. 



Maximum Parsimony on Subsets of Taxa 

We start with some notation. Let T be a rooted binary phylogenetic tree on the leaf set 
(i.e., the set of taxa) X . Let the root of the tree be p. We assume that each vertex in 
T takes one of the two states a and (3. The states evolve from the root state under a 



simple symmetric model of substitution described as follows. Suppose that e = [u, v) is 
an edge of the tree T, where u is the vertex closer to the root than v is. Let pe be the 
substitution probability on edge e: it is the probability that v is in state (3 conditional 
on u being in state a, and is denoted by P(f = f3\u = a). The model is assumed to be 
symmetric, therefore, Pe = IP(f = l3\u = a) = F{v = a\u = (3). Moreover we assume 



discussed elsewhere in the literature (for example. 
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1978 



a spe cial case of the well known symmetric r-state model (see, e.g.. 
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Tufilev and Steel. 



19971 ). A binary character is an assignment of one of the two possible states a and (3 to 
each leaf of the tree, that is, it is a map f : X —>■ {a, (3}. 

In this section we analyze the probability that maximum parsimony applied to a subset 
of the set of taxa correctly estimates the true state at the root. Suppose that y is a subset 
of X (denoted Y C X). It induces a subtree Ty, rooted at a vertex y. Here y is the most 
recent common ancestor of vertices in Y. It is possible that y = p. Let /y denote the 
restriction of a binary character / to Y. MP assigns states a or /? to all internal nodes 
(including the root p) so that the total number of substitutions is minimized. Such an 
assignment is not necessarily unique: MP computes a set Sz of possible states at each 
internal vertex z, so that each most parsimonious assignment must assign one of the 
states in Sz to the vertex z. When MP is applied to a binary character /, we have either 
Sp = {a} or Sp = {P} or Sp = {a,P} at the root p. If Sp is either {a} or then we 
say that MP unambiguously reconstructs the root state; otherwise (when Sp is {a, (3}) we 
say that MP ambiguously reconstructs the root state. 

The maximum parsimony algorithm may also be applied to /y on the subtree Ty. It 
returns a state set Sy = {a} or Sy = {(3} or Sy = {a, (3} for the root y of Ty. 

In the following, we will denote by MP(/, T) the set of character states chosen by Fitch's 
maximum parsimony algorithm as possible root states when applied to a character / on 
a tree T. 

Li, Steel and Zhang defined the unambiguous reconstruction accuracy UA{Y) and the 
ambiguous reconstruction accuracy AA(Y) as follows: 
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UA{Y) 



F{nP{fy,Ty) = {a}\p = a), 



AA{Y) 



P(MP(/y,Ty) = {a,/5}|p = a). 



In other words, UA{Y) is the probabihty that the root state a evolves to a character 
/ for which maximum parsimony on Y assigns the state set {a} to the root y oi Ty- 
Furthermore, they defined the reconstruction accuracy as 



where the second term indicates that when MP reconstructs the state at the root am- 
biguously, we select one of the states with equal probability. 

Note that MP, when apphed to Y, estimates a state at the root vertex y of the subtree 
Ty induced by Y. Since it is possible that the root y of Ty is different from the root p of 
T, we define the reconstructed state at y to be the estimate of the state at the root based 
on the subset Y of taxa. 

Li, Steel and Zhang gave an example of a tree for which the reconstruction accuracy 
of MP on a proper subset of taxa is higher than the reconstruction accuracy of MP on all 
taxa, i.e., RA{Y) > RA{X) for some proper subset Y of X. But their example requires 
that the taxa in Y are closer to the root than the taxa not in Y, i.e., that the probability 
of a substitution from the root to any taxon in Y is smaller than the probability of 
a substitution from the root to the other taxa. The example that we present in the 
following subsection does not require any taxa to be closer to the root. On the contrary, 
our example shows that a misleading taxon or taxa (a taxon or taxa that have an adverse 
effect on the reconstruction accuracy) may be arbitrarily close to the root. 




(1) 
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An Example of a Misleading Taxon 

The main result of this section is the following theorem which shows that there are trees 
on which the reconstruction accuracy improves when a taxon close to the root is ignored 
in an MP based ancestral state reconstruction. Moreover, such a misleading taxon may 
be arbitrarily close to the root. 

Theorem 1. Letpz be any real number such that < Pz < 1/2. Then there exists a binary 
phylogenetic tree T on a leaf set X and rooted at p such that the following conditions are 
satisfied: 

1. for some leaf z, the substitution probability from p to z is p^; 

2. RA{X\{z}) > RA{X); and 

3. for each leaf v ^ z, the substitution probability p^ from p to v is more than p^, i.e., 
z is closer to the root than any other taxon. 

To prove the above theorem, we first need some notation and a lemma. Let y he a 
vertex in a binary phylogenetic tree T, and let Y be the set of leaves below y. We associate 
three probabilities with Y as follows. 

P„(y) := P(MP(/y,Ty) = {a}||/ = a), 
Pp{Y) := P(MP(/y,Ty) = {/3}|y = a), 
P^piY) := FinPify,Ty) = {a,P}\y = a). 



Let T„ be a balanced binary tree of depth n, i.e., with n edges on the path from the 
root to each leaf. Let X be its leaf set. Suppose that the substitution probability on 
each edge of T„ is q. For this particular symmetric tree, we denote Pa{X), P/siX) and 
Pa/3{X) by Pa{n,q), P^{n,q) and Pa/3{n,q), respectively. The convergence properties of 
these probabilities (for n — » cxo an d for various values of q) have been studied in detail in 



(ISteel and Charlestonl . 



19951 ) and flYane . 



20081 ). We state their result on the convergence 
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of Pa{n,q) that additionally provides a lower bound on Pa{n,q) which is independent of 
n. 



Lemma 1. jlSteel and Charleston . 



MM; 



Yang, 



200a) Let T„ he a binary balanced phylo- 



genetic tree of depth n> 2. Let q < 1/8 be the probability of substitution on each edge of 
the tree. Then Pa{n,q) approaches 



1 f 2q v/(l-8g)(l-4g) 

2 I 1 - 2g (1 - 2g)2 



from above as n oo. Moreover, as q goes to 0, the above limiting value approaches 1. 

Proof of TheoremUl Let T be a phylogenetic tree rooted at p constructed as follows. The 
left subtree of T contains a single leaf z. The right subtree of T is Ty with leaf set Y 
and root y. Therefore, the leaf set of T is X = F U {z}. We choose Ty to be a balanced 
binary tree of depth n and a substitution probability q on each edge. Let the substitution 
probabilities on (p, z) and (p, y) be Pz and Py, respectively, where pz is any given real 
number such that < Pz < 1/2. (See an illustration of these parameters in Figure 1. 
For the above tree, the reconstruction accuracy on X is given by 

RAiX) = il-pz)iil-py)Pa{n,q)+PyPpin,q) + Po,p{n,q)) 
+ \pz {{1 - Py)Pa{n,q) + PyPp{n,q)) 
+ ^ (1 - Pz) {PyPa{n, g) + (1 - Py)Pf3{n, q)) . 



The reconstruction accuracy on Y is given by 



RA{Y) = (1 - Py)P^{n, q) + PyPp{n, q) + - P^p{n, q). 



In order to satisfy RA{Y) > RA{X), we therefore must have 



{Pz - Py)Pain, g) > (1 - 2pz)Pap{n, q) + {l-Pz- Py)Pp{n, q). 



(2) 



We now show that for any value of Pz however small, the remaining substitution 
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probabilities q and Py and the depth n of Ty can be chosen such that RA{Y) > RA{X) 
(condition 2 in Theorem [1]) , and for every vertex v in Y, the probabihty of a change of 
state from the root to v is more than (condition 3 in Theorem [T]). 

We express the third condition in Theorem [1] in a different form. Let Q := 1 — 2q, 
Pz := 1 — 2pz and Py := 1 — 2py. Since the tree T„ is symmetric, the probabihty of a 
change of state from the root to any leaf f in y is the same, and is given by p^ = — 2~- 
Therefore, the third condition may now be written as PyQ^ < P^, or equivalently as 

(i-2,r<[f|j. (3) 

It follows from Lemma [U that, for all > 2, as g approaches 0, the left hand side of 
Equation ([2D approaches Pz — Py and the right hand side approaches 0. Therefore, there 
is a real number e such that < e < 1/8, and whenever q < e, Equation ([2]) is satisfied. 
Now given a value of Pz, we first arbitrarily fix py such that < Py < Pz, and then fix 
a value of H := {1 — 2g)" satisfying the constraint in Equation ([3]). We then choose n 
sufficiently large so that g = (1 — if^/")/2 < e and the constraint given in Equation ([21) is 
satisfied as well. This completes the proof. □ 

Note that when g > |, the sequence Pa{n,q) has quite different convergence prop- 
erties than when q < |, and th e boun d provided by Lemma [1] does not apply, (see 



Steel and Charleston 



1995 



Yang, 



2008 



for details). Therefore, our construction of a 
misleading taxon given in the proof of Theorem [T] strongly depends on q being sufficiently 
small. 

A Single Taxon Under a Molecular Clock 

In this section, we consider binary characters on a binary phylogenetic tree T with leaf 
set X under a molecular clock and the two-state symmetric model introduced earlier. Let 
p be the probability that a leaf is in a different state than the root. Therefore, if we were 
to guess the root state by looking at only one taxon, the probability of success would be 
the probability that the root state was conserved at this taxon, which is 1 — p. That is. 
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if y = {xi} is a single taxon subset of X, then RA{Y) — 1 — p. In the following, we 
show that 1 —p is in fact a lower bound on RA{X), implying that MP applied to all taxa 
reconstructs the root state at least as successfully as reconstructing the root state from a 
single taxon. 

As shown in Figure 2, we denote the children of p by yi and 1/2, and define Tj to be 
the subtrees rooted at yi for i in {1, 2}. Let the probabilities of a change of state from p 
to yi be Pi. The probabilities of a change of state from yi to any leaf under yi are p'^. For 
i in {1,2}, we define Pj := 1 — 2pi. Similarly we define P 1 — 2p. 

In the above notation, we prove the following lower bound on RA{X). 

Theorem 2. For any rooted binary phylogenetic ultrametric (clock-like) tree T with leaf 
set X, the reconstruction accuracy of MP is at least equal to the conservation probability 
from the root to any leaf, that is, 



Proof. We first state two recursions, which we then use to give an inductive proof of the 
theorem. 



RA{X) >l-p. 







I-P2 
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I + P2 
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Pa{Y2) + 
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We define D{X) := P^{X)+P^^{X)/2-{l+P)/2, and similarly we define Di := D{Yi) 
and D2 := -0(^2)- The above recursions can be manipulated with a computer algebra 
system to verify that 

4D{X) = 2P^p{Y,)D2P2 + 2P,^(F2)^iPi + 2D2P2 + 2D^P^ + Po.p{Y{)P + Po.p{Y2)P 

Now, by induction on the number of leaves, we show that D{X) is non- negative. The 
base case of the inductive proof is when Yi and Y2 are singleton sets, in which case D{Yi), 
D{Y2) and D{X) are all equal to 0, that is RA{X) is 1 — p. Suppose that the tree T has 
n taxa, and suppose that D{X) is non-negative for all trees having fewer than n taxa. 
Since both Yi and Y2 contain fewer than n taxa, D{Yi) and D{Y2) are both non-negative. 
Since Pa/3(Yi), Pai3{X2), Pi and P2 are all non-negative, the right hand side of the above 
equation is non-negative, implying the theorem. □ 



Discussion 



In this paper we analyzed the question of how MP performs when used to reconstruct the 
ancestral root state. In particular, we considered the problem for phylogenetic trees on 



which the probability of a cha nge of state from t 



le root vertex to any 



Earlier simulation studies (e.g.. iSalisbury and Kim 



2001 



Zhang and Nei 



eaf is constant. 



19971 ) suggested 



that the reconstruction accuracy is generally increased when more taxa are considered. 
But simulations conducted by Li, Steel and Zhang showed that even under a molecular 
clock, MP may perform better on certain subsets of taxa. We present an example of a tree 
in which one of the subtrees at the root consists of a single leaf and a pending edge, and 
the other subtree is a balanced binary tree of large depth and small (< 1/8) substitution 
probabilities on all edges. On this tree, we observed that the ancestral state reconstruction 
is more accurate if only the set of taxa on the balanced subtree is considered. This is in 
contrast to the example given by Li, Steel and Zhang in which an outgroup taxon closer 
to the root or a single fossil record may give a better estimate of the root state than 
considering the whole tree. As our example shows, even a very short edge connecting the 
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root with a leaf cannot guarantee an accurate root state estimation if the remaining taxa 
induce a balanced tree with a large number of taxa. For such trees, it may be better 
to ignore the fossil or a taxon closer to the root. Thus there seems to be no general 
theoretical guideline to decide what subsets of taxa are to be used for a more rehable 
reconstruction of the root state. In general we believe that very long leaf edges would 
have an adverse effect on the ancestral state reconstruction using MP, but it would be 
useful to quantitatively or algorithmically state and prove such an expectation. 

While using the data on a subset of taxa may give a more accurate estimate of the 
root state, in general a single taxon subset does not give a better reconstruction accuracy. 
We showed this by resolving a conjecture of Li, Steel and Zhang. They conjectured that 
for two state characters on an ultrametric (clock-like) tree and a symmetric model of 
substitution, ancestral state reconstruction using all taxa is at least as accurate as that 
using a single taxon. We expect such a result to be true even when there are more than 
two states. 
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Figure captions: 

Figure 1: A tree on which MP is more accurate when apphed to Y G X. 

Figure 2: Illustration for Theorem [21 For any clocklike binary phylogenetic tree T the recon- 
struction accuracy of MP based on all leaves is at least as good as the one based on 
a single leaf. 
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Figure 2 
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